Executive Summary ¶

This analysis seeks two objectives:

(1) predict production levels for the next four quarters via a time series analysis.
(2) predict whether a respondent is or is not in the 1975 labor market via machine learning classifciation models trained on a number of economic and demographic features.

First, this analysis provides the following predictions for production levels over the next four quarters (in our dataset, 2010 Q3 – 2011 Q2) (powered by an ARIMA model and Facebook's Prophet):

Credit is given to Jason Brownlee for incredible explanations of Time Series ML.

Second, this analysis correctly picked which respondent would be in the workforce with 80.79% accuracy (74.6% on cross-validation).

The winning model was a Logistic Regression Classification, followed closely by an XGBoost Classification, followed by a Keras Neural Network Classifier (powered by Google’s TensorFlow).

From a predictive standpoint, the logistic regression classifier found the following features to be most impactful in predicting whether a respondent (in this study, a heterosexual married woman) was in the 1975 labor force. The following is in order of biggest log-odds impact:

federal marginal tax rate facing woman (proxy for wealth)
- increase here = lower labor force likelihood
  space
husband's hourly wage, 1975
- increase here = lower labor force likelihood
  space
hours worked by husband, 1975
- increase here = lower labor force likelihood
  space
number of kids under 6 years old
- increase here = lower labor force likelihood
  space
age
- increase here = lower labor force likelihood
  space
years of experience
- increase here = higher labor force likelihood

From a decion-tree splitting perspective, XGBoost found the following features to be most important (though these numbers have no postive or negative 'sign,' so they do not indicate directionality). The following is in descending order of importantance (most important first):

years of experience
number of kids under 6 years old
number of kids between 6 and 18 years old
federal marginal tax rate facing woman (proxy for wealth)
mother's years of schooling
husband's age
mother's age
years of schooling

Credit to God, my Mother, family and friends.

All errors are my own.

Best,
George John Jordan Thomas Aquinas Hayward, Optimist

Table of Contents ¶

Executive Summary
Selected Data Visualizations
Key Assumptions and Plan of Attack
- Part 0. Load in Dependencies
Problem I: Time Series Forecasting
- Part 1. ARIMA Model
- Part 2. Facebook Prophet Model
  - Additional: Combining The Models All on One Graph
Problem II: Maching Learning Classification
Thank You!

Selected Data Visualizations ¶

Time Series ¶

Time Series Autocorrelation Graph

ARIMA Prediction

Facebook Prophet Prediction

ARIMA vs. FB Prophet Prediction

ARIMA vs. FB Prophet Prediction Numbers

Classification ¶

Feature Pair Plot

Feature Correlation Matrix

Logistic Regression Coefficient Interpretation

XGBoost Feature Importance

Logistic Regression Confustion Matrix

Back to Top

Key Assumptions & Plan of Attack ¶

Time Series

This prediction is only for a “1-4 quarter horizon,” which, for this data set, would be 1 year.
I check for trend (stationary vs. non-stationary) and seasonality (the cyclical nature of the ups and downs in production). I also think about what kind of differencing may be needed.
- A lot of this can be seen from a plotting of the raw data.
I use an autocorrelation plot to see what kind of lag we would want to use in models.
I built an ARIMA model (AutoRegressive Integrated Moving Average):
- The ARIMA models takes 3 parameters:
  - p = number of lag observations needed (the lookback for the autoregression)
  - d = the degree of differencing
  - q = the size of the moving average window
I can tune these parameters in a GridSearch-like procedure.
I also built a Facebook Prophet time series model:
- From Facebook’s development site:
  - “Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.”
  - FB Prophet works automatically, and requires fewer parameters.
Some data scientists prefer ARIMA models because of how customizable they are, but I also wanted to see if the faster FB Prophet model would produce comparable results, which it did.
I evaluated the ARIMA model on an RMSE (root mean squared error) basis.
I then plotted the FB Prophet model against the ARIMA model.
The models are largely in agreement.

Classification

I opted to use three different kinds of regression classification techniques:
- (1) Logistic Regression Classifier
- (2) XGBoost Classifier
- (3) Keras Nerual Network Classifier (with Google’s TensorFlow)
Features were generally continuous, and the data was generally in great shape.
Upon future iterations of this analysis, more features could be engineered (sort of like the experience ^ 2 feature).
I eliminated features that were not predictive because they were not truly independent from the dependent variable:
- For instance, ‘wage’ and ‘lwage’ must be 0 for those outside the labor force and must be greater than 0 for those in the workforce. As such these features are not really predictive.
- Similarly, ‘faminc’ and 'nwifeinc' both take into account the wage of respondent (the woman in the household), so it also gives away whether the respondent is or is not in the workforce, I've also eliminated it.
  - In other words, it includes, though not perfectly, the information contained in the ‘wage’ feature, which will taint the predictive power of the model.
- Finally, 'Repwage' is the wage reported in 1976, but since we're predicting for being in the labor force in 1975, then it should be taken out.
All models were cross-validated, and the final cross validation accuracies were as follows:
- Logistic Regression Classification: 74.6%
- XGBoost Classification: 71.3%
- Keras Neural Network: 71.3%

Back to Top

Part 0. Load in Dependencies ¶

Will load in dependencies for the ARIMA model, Facebook Prophet, XGBoost Classification, Logistic Regression Classification, and Keras Neural Network Classification Model (with Google's TensorFlow). Sk-learn Grid Search will also be loaded, along with Matplotlib.

#for time series work
import altair as alt
alt.renderers.enable('notebook')
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
import pandas as pd
pd.set_option('display.max_columns', None)
from pandas import DataFrame
import numpy as np
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.arima_model import ARIMA
from math import sqrt
from sklearn.metrics import mean_squared_error
from fbprophet import Prophet
import warnings
warnings.filterwarnings("ignore") #just for the final
#for regression classification work
import seaborn as sns; sns.set()
import missingno as msno
from scipy import stats
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, r2_score, median_absolute_error, \
explained_variance_score, confusion_matrix, accuracy_score, precision_score, recall_score
import xgboost as xgb
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold

Back to Top

Part I. Time Series Analysis ¶

Plan to use ARIMA model and Facebook Prohpet Model.

A.) Forecasting

Analyze the “production” time series data in the provided file and choose a forecasting model that provides reasonable forecasts at a 1-4 quarter horizon. In addition to including and showing (through code output, visuals, or both) the selected forecasting model, please include descriptions of the following:

How did you decide on the specific forecasting model? What tests or plots did you use when considering various forecasting approaches?
How accurate is your model? How did you test its performance?

Please note: this exercise is meant to get a general sense of how you think about forecasting problems - it isn’t intended for you to find the “global optimal” forecast methodology for the provided data series. Accordingly, please limit your total time spent on the exercise to approximately one hour.</font>

Back to Top

📈 ARIMA Model 📈¶

#first, we load in the data and set up an ARIMA focused dataframe
df_arima = pd.read_csv('data_A.csv')
#we need to get the string quater naming convention into a datetime friendly format
#I've opted to use January 1 to represent the first quarter, April 1 for the second, July 1 for the third, 
#and October 1 for the fourth...this ensures our data points have the correct 3-month interval cadence
df_arima.time = df_arima.time.str.replace(' Q1', '-01-01', regex=False)
df_arima.time = df_arima.time.str.replace(' Q2', '-04-01', regex=False)
df_arima.time = df_arima.time.str.replace(' Q3', '-07-01', regex=False)
df_arima.time = df_arima.time.str.replace(' Q4', '-10-01', regex=False)
df_arima['time'] = pd.to_datetime(df_arima['time'],format='%Y-%m-%d')
df_arima.head()

#now plotting the same thing more properly
plt.plot(df_arima.time, df_arima.production)
plt.show()

👆🏽Thoughts about this? 👆🏽¶

We can see some seasonality as the curve goes up and down with some regularity.
We do not have a stationary graph, because you can observe a trend from 1960 to 1975, and, again, another trend from 1980 to 2010.

#let's check the autocorrelation to see how many periods back our autoregression should track
autocorrelation_plot(df_arima.production)
plt.title("Autocorrelation Graph", fontweight = 'bold')
plt.savefig('autocorrelation.png',dpi=300, bbox_inches='tight')
plt.show()

👆🏽Thoughts about this? 👆🏽¶

We see a significant correlation up to around 25 periods, and an extremely high, positive correlation when we look at the first 5 periods.
- This makes sense to me because there are only 4 quarters in a year, and I would think that the last year (so last 4 periods) would be perhaps the most helpful when thinking about the next four quarters.

X = df_arima.production.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions_arima = []

# walk-forward validation
for t in range(len(test)):
    model = ARIMA(history, order=(6,1,1)) 
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()
    yhat = output[0]
    predictions_arima.append(yhat)
    obs = test[t]
    history.append(obs)
    print('predicted=%f, expected=%f' % (yhat, obs))

predicted=583.361869, expected=574.000000
predicted=469.884850, expected=443.000000
predicted=422.122631, expected=410.000000
predicted=428.541688, expected=420.000000
predicted=563.170100, expected=532.000000
predicted=439.575125, expected=433.000000
predicted=392.686590, expected=421.000000
predicted=409.343245, expected=410.000000
predicted=532.900785, expected=512.000000
predicted=439.102902, expected=449.000000
predicted=410.926546, expected=381.000000
predicted=415.435590, expected=423.000000
predicted=505.971137, expected=531.000000
predicted=437.016598, expected=426.000000
predicted=397.945416, expected=408.000000
predicted=426.781951, expected=416.000000
predicted=526.411806, expected=520.000000
predicted=441.402115, expected=409.000000
predicted=400.478497, expected=398.000000
predicted=415.083399, expected=398.000000
predicted=509.045246, expected=507.000000
predicted=405.230612, expected=432.000000
predicted=387.455530, expected=398.000000
predicted=406.855965, expected=406.000000
predicted=514.267067, expected=526.000000
predicted=434.329681, expected=428.000000
predicted=407.837323, expected=397.000000
predicted=417.402220, expected=403.000000
predicted=520.466045, expected=517.000000
predicted=431.045554, expected=435.000000
predicted=393.860797, expected=383.000000
predicted=407.807633, expected=424.000000
predicted=514.185726, expected=521.000000
predicted=436.695440, expected=421.000000
predicted=395.823406, expected=402.000000
predicted=423.849517, expected=414.000000
predicted=519.588853, expected=500.000000
predicted=430.609009, expected=451.000000
predicted=390.220780, expected=380.000000
predicted=420.725200, expected=416.000000
predicted=507.246522, expected=492.000000
predicted=441.080057, expected=428.000000
predicted=384.992543, expected=408.000000
predicted=408.082253, expected=406.000000
predicted=491.043855, expected=506.000000
predicted=440.463537, expected=435.000000
predicted=404.738565, expected=380.000000
predicted=422.091693, expected=421.000000
predicted=496.871507, expected=490.000000
predicted=430.317010, expected=435.000000
predicted=387.047701, expected=390.000000
predicted=414.949080, expected=412.000000
predicted=495.536671, expected=454.000000
predicted=438.672501, expected=416.000000
predicted=378.230516, expected=403.000000
predicted=396.894481, expected=408.000000
predicted=453.905157, expected=482.000000
predicted=422.256805, expected=438.000000
predicted=407.626210, expected=386.000000
predicted=429.602944, expected=405.000000
predicted=481.401355, expected=491.000000
predicted=430.930299, expected=427.000000
predicted=391.689579, expected=383.000000
predicted=409.369527, expected=394.000000
predicted=484.187923, expected=473.000000
predicted=427.008078, expected=420.000000
predicted=375.742417, expected=390.000000
predicted=390.939274, expected=410.000000
predicted=472.662500, expected=488.000000
predicted=428.574274, expected=415.000000
predicted=399.863735, expected=398.000000
predicted=416.609365, expected=419.000000
predicted=485.179996, expected=488.000000
predicted=423.665198, expected=414.000000
predicted=401.149464, expected=374.000000

#let's fit the ARIMA model
#the paramters used here are the reult of a grid search I ran in previous analysis
#model = ARIMA(df_arima.production, order=(6,1,1))
#model_fit = model.fit(disp=0)

print(model_fit.summary())

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                    D.y   No. Observations:                  216
Model:                 ARIMA(6, 1, 1)   Log Likelihood                -917.664
Method:                       css-mle   S.D. of innovations             16.626
Date:                Sat, 31 Aug 2019   AIC                           1853.327
Time:                        23:30:02   BIC                           1883.705
Sample:                             1   HQIC                          1865.600
                                                                              
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.7911      0.555      1.425      0.156      -0.297       1.879
ar.L1.D.y     -1.3591      0.142     -9.544      0.000      -1.638      -1.080
ar.L2.D.y     -1.1343      0.160     -7.088      0.000      -1.448      -0.821
ar.L3.D.y     -0.6633      0.169     -3.927      0.000      -0.994      -0.332
ar.L4.D.y      0.3042      0.164      1.860      0.064      -0.016       0.625
ar.L5.D.y      0.6690      0.101      6.592      0.000       0.470       0.868
ar.L6.D.y      0.4357      0.062      6.974      0.000       0.313       0.558
ma.L1.D.y      0.3537      0.157      2.246      0.026       0.045       0.662
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.3435           -0.0000j            1.3435           -0.0000
AR.2           -0.0020           -1.0062j            1.0062           -0.2503
AR.3           -0.0020           +1.0062j            1.0062            0.2503
AR.4           -1.0252           -0.0000j            1.0252           -0.5000
AR.5           -0.9250           -0.8890j            1.2829           -0.3782
AR.6           -0.9250           +0.8890j            1.2829            0.3782
MA.1           -2.8273           +0.0000j            2.8273            0.5000
-----------------------------------------------------------------------------

#we check to see if there are any odd pattens in the residuals over the time series
residuals = DataFrame(model_fit.resid) 
residuals.plot()
plt.show()

#we continue to check residuals
residuals.plot(kind='kde') 
plt.show()

rmse = sqrt(mean_squared_error(test, predictions_arima)) 
print('Test RMSE: %.3f' % rmse)

Test RMSE: 15.996

#time to gridsearch for the best ARIMA model #this code comes from Jason Brownlee
# evaluate an ARIMA model for a given order (p,d,q)
def evaluate_arima_model(X, arima_order):
    # prepare training dataset
    train_size = int(len(X) * 0.66)
    train, test = X[0:train_size], X[train_size:]
    history = [x for x in train]
    # make predictions
    predictions_eval = []
    for t in range(len(test)):
        model = ARIMA(history, order=arima_order)
        model_fit = model.fit(disp=0)
        yhat = model_fit.forecast()[0]
        predictions_eval.append(yhat)
        history.append(test[t])
    # calculate out of sample error
    rmse = sqrt(mean_squared_error(test, predictions_eval))
    return rmse

# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(dataset, p_values, d_values, q_values):
    dataset = dataset.astype('float32')
    best_score, best_cfg = float("inf"), None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                order = (p,d,q)
                try:
                    rmse = evaluate_arima_model(dataset, order)
                    if rmse < best_score:
                        best_score, best_cfg = rmse, order
                    print('ARIMA%s RMSE=%.3f' % (order,rmse))
                except:
                    continue
    print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))

# evaluate parameters
p_values = [3,4,5,6,7]
d_values = range(0, 3)
q_values = range(0, 3)
evaluate_models(df_arima.production.values, p_values, d_values, q_values)

ARIMA(3, 0, 0) RMSE=50.645
ARIMA(3, 0, 1) RMSE=35.438
ARIMA(3, 0, 2) RMSE=26.444
ARIMA(3, 1, 0) RMSE=17.807
ARIMA(3, 1, 1) RMSE=17.850
ARIMA(3, 2, 0) RMSE=24.093
ARIMA(3, 2, 2) RMSE=16.793
ARIMA(4, 0, 1) RMSE=16.417
ARIMA(4, 1, 0) RMSE=17.877
ARIMA(4, 1, 1) RMSE=16.924
ARIMA(4, 2, 1) RMSE=16.877
ARIMA(4, 2, 2) RMSE=16.893
ARIMA(5, 0, 0) RMSE=16.427
ARIMA(5, 1, 0) RMSE=17.606
ARIMA(5, 1, 1) RMSE=17.024
ARIMA(5, 2, 1) RMSE=64.289
ARIMA(6, 0, 0) RMSE=17.086
ARIMA(6, 1, 0) RMSE=16.269
ARIMA(6, 1, 1) RMSE=15.996
ARIMA(7, 0, 0) RMSE=16.105
Best ARIMA(6, 1, 1) RMSE=15.996

#4 more predictions
# walk-forward validation
predictions_four_more_arima = []
for t in range(4):
    model = ARIMA(history, order=(6,1,1)) 
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()
    yhat = output[0]
    predictions_four_more_arima.append(yhat)
    history.append(yhat)
    print('predicted=%f' % (yhat))

predicted=419.813278
predicted=480.280402
predicted=410.967236
predicted=379.157988

#key for us is what are the next four quarters going to be like?
for i in range(4):
    print(predictions_four_more_arima[i])

[419.81327788]
[480.28040207]
[410.96723569]
[379.15798845]

#we can add those four predictions to a dataframe, and concatenate it to the original
pred_array_arima = {'time': ['2010-07-01', '2010-10-01', '2011-01-01', '2011-04-01'], 'production': [428.72285814,\
                                                        481.60032708,410.96723569,379.15798845]}
pred_df_arima = pd.DataFrame(data=pred_array_arima)
pred_df_arima

combined_future_arima = pd.concat([df_arima,pred_df_arima], axis=0, ignore_index=True)
combined_future_arima['time'] = pd.to_datetime(combined_future_arima['time'],format='%Y-%m-%d')

plt.plot(df_arima.time, df_arima.production, label="Actual")
plt.plot(combined_future_arima.time[-5:], combined_future_arima.production[-5:], label = 'ARIMA Predicted')
plt.suptitle("Production Historicals & Predictions: \n Time Series Analysis (ARIMA)",\
             fontsize = 12, fontweight = 'bold')
plt.xlabel("Year")
plt.ylabel("Production")
#plt.xticks(np.arange(20, 220, step=38), ('1960', '1970', '1980', '1990', '2000','2010'))
plt.legend()
plt.savefig('arima_prediction.png',dpi=300, bbox_inches='tight')
plt.show()

Back to Top

📈 Facebook Prophet Model 📈¶

df_prophet = pd.read_csv('data_A.csv')
df_prophet.time = df_prophet.time.str.replace(' Q1', '-01-01', regex=False)
df_prophet.time = df_prophet.time.str.replace(' Q2', '-04-01', regex=False)
df_prophet.time = df_prophet.time.str.replace(' Q3', '-07-01', regex=False)
df_prophet.time = df_prophet.time.str.replace(' Q4', '-10-01', regex=False)
#need to get the format and naming conventions correct for Facebook Prophet
df_prophet['time'] = pd.to_datetime(df_prophet['time'],format='%Y-%m-%d')
df_prophet = df_prophet.rename(columns={"time": "ds", "production": "y"})
df_prophet.head()

m = Prophet(weekly_seasonality=False, daily_seasonality=False)
m.fit(df_prophet)

<fbprophet.forecaster.Prophet at 0x1c1f1d3c50>

future = m.make_future_dataframe(periods=365)

forecast = m.predict(future)
forecast[['ds', 'yhat']].tail()

prophet_predictor_times = ['2010-07-01', '2010-10-01', '2011-01-01', '2011-04-01']

dates = []
prophet_predictions = []
for i in prophet_predictor_times:
    dates.append(i)
    prophet_predictions.append(forecast.yhat[forecast.ds == i].values)

for i in prophet_predictions:
    print(i)

[387.42722424]
[485.30002099]
[417.43768147]
[371.68171404]

pred_array_prophet = {'ds': ['2010-07-01', '2010-10-01', '2011-01-01', '2011-04-01'], 'y': [387.42722424,\
                                                        485.30002099,417.43768147,371.68171404]}
pred_df_prophet  = pd.DataFrame(data=pred_array_prophet )
pred_df_prophet

combined_future_prophet = pd.concat([df_prophet,pred_df_prophet], axis=0, ignore_index=True)
combined_future_prophet['ds'] = pd.to_datetime(combined_future_prophet['ds'],format='%Y-%m-%d')

plt.plot(df_prophet.ds, df_prophet.y, label="Actual")
plt.plot(combined_future_prophet.ds[-5:], combined_future_prophet.y[-5:], label = 'FB Prophet Predicted')
plt.suptitle("Production Historicals & Predictions: \n Time Series Analysis (Facebook Prophet)",\
             fontsize = 12, fontweight = 'bold')
plt.xlabel("Year")
plt.ylabel("Production")
plt.legend()
plt.savefig('fb_prophet_prediction.png',dpi=300, bbox_inches='tight')
plt.show()

Back to Top

📈 Combining The Models All on One Graph 📈¶

plt.plot(df_prophet.ds, df_prophet.y, label="Actual")
plt.plot(combined_future_prophet.ds[-5:], combined_future_prophet.y[-5:], label = 'FB Prophet Predicted')
plt.plot(combined_future_arima.time[-5:], combined_future_arima.production[-5:], label = 'ARIMA Predicted')
plt.suptitle("Production Historicals and Predictions \n Time Series Analysis (ARIMA & Facebook Prophet)",\
             fontsize = 12, fontweight = 'bold')
plt.xlabel("Year")
plt.ylabel("Production")
plt.legend()
plt.savefig('all_models.png',dpi=300, bbox_inches='tight')
plt.show()

#we can add those four predictions to a dataframe, and concatenate it to the original
pred_array_arima_and_prophet = {'Quarter': ['2010-07-01', '2010-10-01', '2011-01-01', '2011-04-01'],\
                            'ARIMA Predictions': [428.72285814, 481.60032708, 410.96723569, 379.15798845],\
                            'FB Prophet Predictions': [387.42722424,485.30002099,417.43768147,371.68171404]
                               }
pred_df_arima_prophet = pd.DataFrame(data=pred_array_arima_and_prophet)
pred_df_arima_prophet

Back to Top

Part II. Classification Analysis ¶

Plan to a logistic regression classification, a XGBoost classification, and a Keras Neural Network classification.

B.) Regression/Machine Learning

Use the data provided to create a model that predicts labor force participation (inlf variable in the dataset).
You are free to use any combination of the other variables for this prediction.
Here is a list of the included variables and their descriptions.

In addition to describing the selected model, please describe how you chose the model and how you tested its effectiveness in predicting the variable of interest.</font>

Back to Top

Exploratory Data Analysis ¶

labordata = pd.read_csv("data_B.csv")

labordata.head()

msno.matrix(labordata,  color = (.0,.0,.2))

<matplotlib.axes._subplots.AxesSubplot at 0x1c1e0d0278>

👆🏽Thoughts about this? 👆🏽¶

Using this null data visualization, we can immediately see that there is something up with the 'wage' and 'lwage' columns.
This makes sense:
- If you are making a wage or thus a log-transformed wage, then you must have been working in 1975. Thus, this data point is not actually predictive, and I'll soon take it out of the dataset.

#missing data
total = labordata.isnull().sum().sort_values(ascending=False)
percent = (labordata.isnull().sum()/labordata.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head()

#dropping features we do not want in the analysis
labordata_processed = labordata.drop(['lwage','wage', 'repwage','hours',\
                                     'faminc','nwifeinc'], axis = 1)
#each of these feautres is basically a proxy for already being in the workforce, so they aren't predictors
#repwage happens in 1976, but we're predicting for 1975 so that has to be out
#family income and nwifeinc are backdoors for the income of the woman, who would then already be in the workforce

👆🏽Thoughts about this? 👆🏽¶

Each of these features is basically a proxy for already being in the workforce, so they aren't predictors.
Along those lines, 'faminc' and 'nwifeinc' both take into account the wage of the person being studied, so it also gives away whether the respondent is or is not in the workforce, I've also eliminated it.
'Repwage' is the wage reported in 1976, but since we're predicting for being in the labor force in 1975, then it should be taken out.

msno.matrix(labordata_processed,  color = (.0,.0,.2))

<matplotlib.axes._subplots.AxesSubplot at 0x1c206285f8>

#all credit due to: Pedro Marcelino 
#https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
#correlation matrix
corrmat = labordata_processed.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)
plt.savefig('labor_data_correlation_matrix.png',dpi=300, bbox_inches='tight')

👆🏽Thoughts about this? 👆🏽¶

Correlation plots are great for seeing which kinds of features 'move' with each other.
One interesting thing to see is the correlation between age and husband's age.
- Looks like people of around the same age get married together.

#all credit due to: Pedro Marcelino 
#https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
#scatterplot
sns.set()
#cols = ['column1', 'column2']
sns.pairplot(labordata_processed, size = 2.5)
#plt.savefig('exhibit_0b_pair_plot.png',dpi=600, bbox_inches='tight')
#600 dpi is commented out. It's useful for viewing all the data up close on your desktop, but slows down the script.
plt.savefig('labor_data_pair_plot_lower_res.png',dpi=150, bbox_inches='tight')
plt.show()

👆🏽Thoughts about this? 👆🏽¶

Pair plots are great for quickly scanning all the data and seeing realtionships.
- We must remember that they don't take into account interactions.

#address skew in features
#For this block, credit goes to Alexandru Papiu 
#(https://www.kaggle.com/apapiu/regularized-linear-models)
#log transform skewed numeric features:
continous_features_classification = ['kidslt6', 'kidsge6', 'age', 'educ', 'hushrs', 'husage',\
'huseduc', 'huswage', 'mtr', 'motheduc', 'fatheduc', 'unem', 'exper', 'expersq'] 
#exposure units have already been taken out
skewed_feats = labordata_processed[continous_features_classification].apply(\
                                                                                lambda x: stats.skew(x)) 
#compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

labordata_processed[skewed_feats] = np.log1p(labordata_processed[skewed_feats])

Back to Top

📊 XGBoost Classification 📊¶

#xgboost
classification_features_xgb = labordata_processed.drop(['inlf'], axis = 1)
classification_outcome_xgb = labordata_processed.inlf
train_features, test_features, train_labels, test_labels = train_test_split(classification_features_xgb,\
                                                                    classification_outcome_xgb, test_size = 0.2)

xgb_classy = xgb.XGBClassifier()

xgb_classy.fit(train_features,train_labels)

xgb_classy_predictions = xgb_classy.predict(test_features)


#run model
print("_________XGBoost Regression Classification_________")
print("")
print("Scored Against Itself")
print('Accuracy Score: {}'.format(round(xgb_classy.score(train_features, train_labels),3)))
print("")
print("Scored Against Test Data")
print('Accuracy Score: {}'.format(round(xgb_classy.score(test_features, test_labels),3)))

_________XGBoost Regression Classification_________

Scored Against Itself
Accuracy Score: 0.887

Scored Against Test Data
Accuracy Score: 0.748

#cross-val #this will run about 70 seconds per print
print("_________Cross-Validation Scoring for XGBoost Classification_________")
print('Accuracy: {}'.format(round(cross_val_score(xgb_classy, train_features, train_labels, \
                                                                cv=10, scoring='accuracy').mean(),3)))
print('Precision: {}'.format(round(cross_val_score(xgb_classy, train_features, train_labels, \
                                                                cv=10, scoring='precision').mean(),3)))
print('Recall: {}'.format(round(cross_val_score(xgb_classy, train_features, train_labels, \
                                                                cv=10, scoring='recall').mean(),3)))

_________Cross-Validation Scoring for XGBoost Classification_________
Accuracy: 0.713
Precision: 0.739
Recall: 0.769

top_10_xgb_features = pd.DataFrame(sorted(list(zip(classification_features_xgb,xgb_classy.feature_importances_))\
       ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'XGBoost Importance'])
top_10_xgb_features

#plt.xticks(rotation=-25)
bar_count = range(len(top_10_xgb_features.Feature))
fig, axs = plt.subplots(ncols=2, figsize=(14,4))
#using a subplot method coupled with an inline parameter to have high resolution
#note: "[::-1]" reverses the column in a pandas dataframe
axs[1].set_axis_off()
axs[0].barh(bar_count, top_10_xgb_features['XGBoost Importance'][::-1],\
                 align='center', alpha=1)
axs[0].set_xlabel('Values')
axs[0].set_yticks(bar_count)
axs[0].set_yticklabels(top_10_xgb_features.Feature[::-1], fontsize=10)
axs[0].set_xlabel('XGBoost Importance')
axs[0].set_title("XGBoost's Feature Importances",fontweight = 'bold')

extent = axs[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
fig.savefig('laborforce_xgbfeatures',dpi=300, bbox_inches=extent.expanded(1.5, 1.5))
plt.show()

xgb_classy_predictions = xgb_classy.predict(test_features) #redundant, but copied here too so I can see it
xgb_cm = confusion_matrix(test_labels, xgb_classy_predictions)
#print(cm) #this is the barebones confusion matrix

#all credit due to: Michael Galarnyk, "Logistic Regression using Python (scikit-learn)", Towards Data Science 
plt.figure(figsize=(9,9))
sns.heatmap(xgb_cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual Label');
plt.xlabel('Predicted Label');
all_sample_title = 'XGBoost Regression Classification \n Accuracy Score: {0:.3f}'.format(\
                                                                xgb_classy.score(test_features, test_labels))
plt.title(all_sample_title, size = 15)
plt.savefig('exhibit_h_xgboost_regression_confusion_matrix',dpi=300, bbox_inches='tight')
print("Accuracy: "+str(accuracy_score(test_labels, xgb_classy_predictions))) #this is just a little check at the end
print("Precision: "+str(precision_score(test_labels, xgb_classy_predictions))) #this is just a little check at the end
print("Recall: "+str(recall_score(test_labels, xgb_classy_predictions))) #this is just a little check at the end

Accuracy: 0.7483443708609272
Precision: 0.7608695652173914
Recall: 0.813953488372093

Back to Top

📊 Logistic Regression Classification 📊¶

#logistic regression
classification_features = labordata_processed.drop(['inlf'], axis = 1)
classification_outcome = labordata_processed.inlf
train_features, test_features, train_labels, test_labels = train_test_split(classification_features,\
                                                                            classification_outcome, test_size = 0.2)
#normalize this, since sklearn's logistic regression uses regularization
scaler = StandardScaler()
#"To determine the scaling factors and apply the scaling to the feature data:" -Codecademy
classification_train_features = scaler.fit_transform(train_features)
#"To apply the scaling to the test data:" -Codecademy
classification_test_features = scaler.transform(test_features) #we do NOT want to fit to the test

#run model
log_model = LogisticRegression(solver="liblinear") #to remove warning
log_model.fit(classification_train_features, train_labels)
print("_________Logistic Regression Classification_________")
print("")
print("Scored Against Itself")
print('Accuracy Score: {}'.format(round(log_model.score(classification_train_features, train_labels),3)))
print("")
print("Scored Against Test Data")
print('Accuracy Score: {}'.format(round(log_model.score(classification_test_features, test_labels),3)))

_________Logistic Regression Classification_________

Scored Against Itself
Accuracy Score: 0.771

Scored Against Test Data
Accuracy Score: 0.808

#cross-val
print("_________Cross-Validation Scoring for Logistic Regression Classification_________")
print('Accuracy: {}'.format(round(cross_val_score(log_model, classification_train_features, train_labels, \
                                                                cv=10, scoring='accuracy').mean(),3)))
print('Precision: {}'.format(round(cross_val_score(log_model, classification_train_features, train_labels, \
                                                                cv=10, scoring='precision').mean(),3)))
print('Recall: {}'.format(round(cross_val_score(log_model, classification_train_features, train_labels, \
                                                                cv=10, scoring='recall').mean(),3)))

_________Cross-Validation Scoring for Logistic Regression Classification_________
Accuracy: 0.746
Precision: 0.761
Recall: 0.803

log_regression_feature_list = []
log_regression_coef_list = []
odds_ratios = []
percent_change_in_odds = []
for i in classification_features.columns:
    log_regression_feature_list.append(i)
for i in range(len(log_model.coef_[0])):
    log_regression_coef_list.append(log_model.coef_[0][i])
for i in log_regression_coef_list:
    odds_ratios.append(np.exp(i))
for i in odds_ratios:
    percent_change_in_odds.append(round((i-1)*100,2))

top_10_log_reg_features = pd.DataFrame(sorted(list(zip(log_regression_feature_list,log_regression_coef_list, \
                                                       percent_change_in_odds))\
       ,key = lambda x: abs(x[1]),reverse=True)[:10], columns=['Feature', 'Logistic Regression Coefficient',\
                                                              'Percent Change in Odds of Being in Labor Force'])
top_10_log_reg_features

log_model_predictions = log_model.predict(classification_test_features)
cm = confusion_matrix(test_labels, log_model_predictions)
#print(cm) #this is the barebones confusion matrix

#all credit due to: Michael Galarnyk, "Logistic Regression using Python (scikit-learn)", Towards Data Science 
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual Label');
plt.xlabel('Predicted Label');
all_sample_title = 'Logistic Regression Classification \n Accuracy Score: {0:.3f}'.format(\
                                                log_model.score(classification_test_features, test_labels))
plt.title(all_sample_title, size = 15)
plt.savefig('exhibit_g_logistic_regression_confusion_matrix',dpi=300, bbox_inches='tight')
print("Accuracy: "+str(accuracy_score(test_labels, log_model_predictions))) #this is just a little check at the end
print("Precision: "+str(precision_score(test_labels, log_model_predictions))) #this is just a little check at the end
print("Recall: "+str(recall_score(test_labels, log_model_predictions))) #this is just a little check at the end

Accuracy: 0.8079470198675497
Precision: 0.8152173913043478
Recall: 0.8620689655172413

Back to Top

📊 Keras Nerual Network Classification 📊¶

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

#keras neural network regression
classification_features_nn = labordata_processed.drop(['inlf'], axis = 1)
classification_outcome_nn = labordata_processed.inlf
train_features_nn, test_features_nn, train_labels_nn, test_labels_mm = train_test_split(classification_features_nn,\
                                                                classification_outcome_nn, test_size = 0.2)

#normalize this, since sklearn's logistic regression uses regularization
scaler = StandardScaler()
#"To determine the scaling factors and apply the scaling to the feature data:" -Codecademy
classification_train_features_nn = scaler.fit_transform(train_features_nn)
#"To apply the scaling to the test data:" -Codecademy
classification_test_features_nn = scaler.transform(test_features_nn) #we do NOT want to fit to the test

def create_model(optimizer='rmsprop', init='uniform'):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=15, activation='relu'))
    model.add(Dense(15, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    #model.fit(train_features, train_labels, epochs=150, batch_size=10)
    return model

warnings.filterwarnings("ignore",category=DeprecationWarning)
# create model
model = KerasClassifier(build_fn=create_model, epochs=150, batch_size=10, verbose=0)
model.fit(train_features, train_labels, epochs=150, batch_size=10)
# evaluate using 10-fold cross validation
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(model, classification_train_features_nn, train_labels_nn, cv=kfold)
print(results.mean())

0.7112035788444553

print("_________Cross-Validation Scoring for Keras Neural Network Classification_________")
print('Accuracy: {}'.format(round(cross_val_score(model, classification_train_features_nn, train_labels_nn, \
                                                                cv=10, scoring='accuracy').mean(),3)))
print('Precision: {}'.format(round(cross_val_score(model, classification_train_features_nn, train_labels_nn, \
                                                                cv=10, scoring='precision').mean(),3)))
print('Recall: {}'.format(round(cross_val_score(model, classification_train_features_nn, train_labels_nn, \
                                                                cv=10, scoring='recall').mean(),3)))

_________Cross-Validation Scoring for Keras Neural Network Classification_________
Accuracy: 0.713
Precision: 0.737
Recall: 0.797

model.fit(classification_train_features_nn, train_labels_nn, epochs=150, batch_size=10)

<keras.callbacks.History at 0x1c85b68630>

#keras
model_predictions_nn = model.predict(classification_test_features_nn)
cm = confusion_matrix(test_labels, model_predictions_nn)
#print(cm) #this is the barebones confusion matrix

#all credit due to: Michael Galarnyk, "Logistic Regression using Python (scikit-learn)", Towards Data Science 
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".0f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual Label');
plt.xlabel('Predicted Label');
all_sample_title = 'Keras Neural Netowrk Classification \n Accuracy Score: {0:.3f}'.format(\
                                                model.score(classification_test_features_nn, test_labels))
plt.title(all_sample_title, size = 15)
plt.savefig('neural_network_confusion_matrix',dpi=300, bbox_inches='tight')
print("Accuracy: "+str(accuracy_score(test_labels, model_predictions_nn))) #this is just a little check at the end
print("Precision: "+str(precision_score(test_labels, model_predictions_nn))) #this is just a little check at the end
print("Recall: "+str(recall_score(test_labels, model_predictions_nn))) #this is just a little check at the end

Accuracy: 0.41721854304635764
Precision: 0.49382716049382713
Recall: 0.45977011494252873

👆🏽Thoughts about this? 👆🏽¶

I believe this accuracy is so much lower than ususal just becasue of the test group selection.
The cross-validation accuracy is much higher (>70%)

👇🏾Thoughts about this? 👇🏾¶

The below block of code comes from Jason Brownlee.

warnings.filterwarnings("ignore",category=DeprecationWarning)
# grid search epochs, batch size and optimizer
optimizers = ['rmsprop', 'adam']
inits = ['glorot_uniform', 'normal', 'uniform']
epochs = [50, 100, 150]
batches = [5, 10, 20]
param_grid = dict(optimizer=optimizers, epochs=epochs, batch_size=batches, init=inits)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(train_features, train_labels)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.636213 using {'batch_size': 5, 'epochs': 50, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.566445 (0.052454) with: {'batch_size': 5, 'epochs': 50, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.528239 (0.079805) with: {'batch_size': 5, 'epochs': 50, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.471761 (0.079805) with: {'batch_size': 5, 'epochs': 50, 'init': 'normal', 'optimizer': 'rmsprop'}
0.528239 (0.079805) with: {'batch_size': 5, 'epochs': 50, 'init': 'normal', 'optimizer': 'adam'}
0.636213 (0.032536) with: {'batch_size': 5, 'epochs': 50, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.471761 (0.079805) with: {'batch_size': 5, 'epochs': 50, 'init': 'uniform', 'optimizer': 'adam'}
0.476744 (0.081397) with: {'batch_size': 5, 'epochs': 100, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.558140 (0.085022) with: {'batch_size': 5, 'epochs': 100, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.584718 (0.101911) with: {'batch_size': 5, 'epochs': 100, 'init': 'normal', 'optimizer': 'rmsprop'}
0.438538 (0.058214) with: {'batch_size': 5, 'epochs': 100, 'init': 'normal', 'optimizer': 'adam'}
0.471761 (0.079805) with: {'batch_size': 5, 'epochs': 100, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.498339 (0.042208) with: {'batch_size': 5, 'epochs': 100, 'init': 'uniform', 'optimizer': 'adam'}
0.471761 (0.079805) with: {'batch_size': 5, 'epochs': 150, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.433555 (0.052454) with: {'batch_size': 5, 'epochs': 150, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.574751 (0.093553) with: {'batch_size': 5, 'epochs': 150, 'init': 'normal', 'optimizer': 'rmsprop'}
0.581395 (0.063342) with: {'batch_size': 5, 'epochs': 150, 'init': 'normal', 'optimizer': 'adam'}
0.476744 (0.081397) with: {'batch_size': 5, 'epochs': 150, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.558140 (0.120625) with: {'batch_size': 5, 'epochs': 150, 'init': 'uniform', 'optimizer': 'adam'}
0.433555 (0.052454) with: {'batch_size': 10, 'epochs': 50, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.528239 (0.079805) with: {'batch_size': 10, 'epochs': 50, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.523256 (0.081397) with: {'batch_size': 10, 'epochs': 50, 'init': 'normal', 'optimizer': 'rmsprop'}
0.476744 (0.081397) with: {'batch_size': 10, 'epochs': 50, 'init': 'normal', 'optimizer': 'adam'}
0.438538 (0.058214) with: {'batch_size': 10, 'epochs': 50, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.433555 (0.052454) with: {'batch_size': 10, 'epochs': 50, 'init': 'uniform', 'optimizer': 'adam'}
0.566445 (0.052454) with: {'batch_size': 10, 'epochs': 100, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.498339 (0.100967) with: {'batch_size': 10, 'epochs': 100, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.523256 (0.081397) with: {'batch_size': 10, 'epochs': 100, 'init': 'normal', 'optimizer': 'rmsprop'}
0.476744 (0.081397) with: {'batch_size': 10, 'epochs': 100, 'init': 'normal', 'optimizer': 'adam'}
0.586379 (0.056357) with: {'batch_size': 10, 'epochs': 100, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.566445 (0.052454) with: {'batch_size': 10, 'epochs': 100, 'init': 'uniform', 'optimizer': 'adam'}
0.528239 (0.079805) with: {'batch_size': 10, 'epochs': 150, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.529900 (0.078042) with: {'batch_size': 10, 'epochs': 150, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.438538 (0.058214) with: {'batch_size': 10, 'epochs': 150, 'init': 'normal', 'optimizer': 'rmsprop'}
0.476744 (0.081397) with: {'batch_size': 10, 'epochs': 150, 'init': 'normal', 'optimizer': 'adam'}
0.438538 (0.058214) with: {'batch_size': 10, 'epochs': 150, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.476744 (0.081397) with: {'batch_size': 10, 'epochs': 150, 'init': 'uniform', 'optimizer': 'adam'}
0.549834 (0.109166) with: {'batch_size': 20, 'epochs': 50, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.523256 (0.081397) with: {'batch_size': 20, 'epochs': 50, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.544851 (0.078925) with: {'batch_size': 20, 'epochs': 50, 'init': 'normal', 'optimizer': 'rmsprop'}
0.471761 (0.079805) with: {'batch_size': 20, 'epochs': 50, 'init': 'normal', 'optimizer': 'adam'}
0.433555 (0.052454) with: {'batch_size': 20, 'epochs': 50, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.471761 (0.079805) with: {'batch_size': 20, 'epochs': 50, 'init': 'uniform', 'optimizer': 'adam'}
0.534884 (0.078619) with: {'batch_size': 20, 'epochs': 100, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.561462 (0.058214) with: {'batch_size': 20, 'epochs': 100, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.476744 (0.081397) with: {'batch_size': 20, 'epochs': 100, 'init': 'normal', 'optimizer': 'rmsprop'}
0.475083 (0.080802) with: {'batch_size': 20, 'epochs': 100, 'init': 'normal', 'optimizer': 'adam'}
0.578073 (0.083920) with: {'batch_size': 20, 'epochs': 100, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.566445 (0.052454) with: {'batch_size': 20, 'epochs': 100, 'init': 'uniform', 'optimizer': 'adam'}
0.436877 (0.053243) with: {'batch_size': 20, 'epochs': 150, 'init': 'glorot_uniform', 'optimizer': 'rmsprop'}
0.566445 (0.052454) with: {'batch_size': 20, 'epochs': 150, 'init': 'glorot_uniform', 'optimizer': 'adam'}
0.471761 (0.079805) with: {'batch_size': 20, 'epochs': 150, 'init': 'normal', 'optimizer': 'rmsprop'}
0.438538 (0.058214) with: {'batch_size': 20, 'epochs': 150, 'init': 'normal', 'optimizer': 'adam'}
0.566445 (0.052454) with: {'batch_size': 20, 'epochs': 150, 'init': 'uniform', 'optimizer': 'rmsprop'}
0.576412 (0.042117) with: {'batch_size': 20, 'epochs': 150, 'init': 'uniform', 'optimizer': 'adam'}

	time	production
0	1956-01-01	284
1	1956-04-01	213
2	1956-07-01	227
3	1956-10-01	308
4	1957-01-01	262

	ds	y
0	1956-01-01	284
1	1956-04-01	213
2	1956-07-01	227
3	1956-10-01	308
4	1957-01-01	262

	ds	yhat
578	2011-03-28	369.188152
579	2011-03-29	369.237345
580	2011-03-30	369.676611
581	2011-03-31	370.495194
582	2011-04-01	371.681714

	Feature	Logistic Regression Coefficient	Percent Change in Odds of Being in Labor Force
0	mtr	-1.249254	-71.33
1	huswage	-1.224260	-70.60
2	hushrs	-0.683773	-49.53
3	kidslt6	-0.652633	-47.93
4	age	-0.561213	-42.95
5	exper	0.417370	51.80
6	educ	0.398756	49.00
7	expersq	0.364379	43.96
8	huseduc	-0.223389	-20.02
9	motheduc	0.153784	16.62

Executive Summary ¶

Table of Contents ¶

Selected Data Visualizations ¶

Time Series ¶

Classification ¶

Key Assumptions & Plan of Attack ¶

Part 0. Load in Dependencies ¶

Part I. Time Series Analysis ¶

A.) Forecasting

📈 ARIMA Model 📈¶

👆🏽Thoughts about this? 👆🏽¶

👆🏽Thoughts about this? 👆🏽¶

📈 Facebook Prophet Model 📈¶

📈 Combining The Models All on One Graph 📈¶

Part II. Classification Analysis ¶

B.) Regression/Machine Learning

Exploratory Data Analysis ¶

👆🏽Thoughts about this? 👆🏽¶

👆🏽Thoughts about this? 👆🏽¶

👆🏽Thoughts about this? 👆🏽¶

👆🏽Thoughts about this? 👆🏽¶

📊 XGBoost Classification 📊¶

📊 Logistic Regression Classification 📊¶

📊 Keras Nerual Network Classification 📊¶

👆🏽Thoughts about this? 👆🏽¶

👇🏾Thoughts about this? 👇🏾¶

Thank you so much for your time and consideration!¶

Best,¶

George John Jordan Thomas Aquinas Hayward, Optimist

	time	production
0	2010-07-01	428.722858
1	2010-10-01	481.600327
2	2011-01-01	410.967236
3	2011-04-01	379.157988

	ds	y
0	2010-07-01	387.427224
1	2010-10-01	485.300021
2	2011-01-01	417.437681
3	2011-04-01	371.681714

	inlf	hours	kidslt6	kidsge6	age	educ	wage	repwage	hushrs	husage	huseduc	huswage	faminc	mtr	motheduc	fatheduc	unem	city	exper	nwifeinc	lwage	expersq
0	1	1610	1	0	32	12	3.3540	2.65	2708	34	12	4.0288	16310	0.7215	12	7	5.0	0	14	10.910060	1.210154	196
1	1	1656	0	2	30	12	1.3889	2.65	2310	30	9	8.4416	21800	0.6615	7	7	11.0	1	5	19.499981	0.328512	25
2	1	1980	1	3	35	12	4.5455	4.04	3072	40	12	3.5807	21040	0.6915	12	7	5.0	0	15	12.039910	1.514138	225
3	1	456	0	3	34	12	1.0965	3.25	1920	53	10	3.5417	7300	0.7815	7	7	5.0	0	6	6.799996	0.092123	36
4	1	1568	1	2	31	14	4.5918	3.60	2000	32	12	10.0000	27300	0.6215	12	14	9.5	1	7	20.100060	1.524272	49

	Total	Percent
lwage	325	0.431607
wage	325	0.431607
expersq	0	0.000000
hours	0	0.000000
kidslt6	0	0.000000

	Feature	XGBoost Importance
0	exper	0.176079
1	kidslt6	0.081903
2	kidsge6	0.079420
3	mtr	0.079354
4	motheduc	0.075405
5	husage	0.074865
6	age	0.073868
7	educ	0.072686
8	huswage	0.063095
9	hushrs	0.051104